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ABSTRACT 

The reliability of four measures of written 
expression was examined (total words written, mature words, words 
spelled correctly, and letters in sequence) . Subjects included 
elementary-age students in several school districts, some of whom 
were learning disabled. Results revealed high coefficients for 
test-retest reliability, parallel-form reliability, split-half 
reliability, and interscorer reliability. Further, the reliability 
coefficients for total words, words spelled correctly, and letters in 
sequence were consistently superior, demonstrating significant 
precision in measurement. Two implications are drawn from the 
research: (1) high reliability estimates of the measures of written 
expression provide a necessary basis for the determination of their 
validity; and (2) the research assures teachers and educational 
professionals using formative evaluation measures that such 
procedures are accurate and stable. (Author/GK) 
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Abstract 

The reliability of .four measures of written expression (Total 

«^ 

\ 

Words Written, Mature Words, Words S,{Jelled Correctly, and Letters in 
Correct Sequence) was examined. Subjects included elementary-age 
students in several school districts, some of whom were learning 
disabled. Results revealed high coefficients for test-retest relia- 
bility, parallel-form reliability, split-half reliability, and inter- 
scorer reliability. 



The Reliability of Simple, Direct Measures 
of Written Expression 

Formative evaluation systems utilizing simple, direct measures 
of academic performance may be employed as an alternative to tradi- 
tional methods of assessing the needs of learning disabled students 
(Crutcher & Hoffmeister, 1975; Lovitt, Schaff, & Sayre, 1970; Mirkin 
& Deno, 1979). Within the formative evaluation framework, student 
academic performance is monitored on a frequent basis. Analysis of 
these time-series data should aid in the diagnosis and prescription of 
effective programs for students receiving learning disability services. 

However, for formative evaluation systems to be implemented in the 
classroom, a clear description of the academic behaviors to be measured 
must be established. .Because such measures are to be used in 4 making 
decisions that influence educational programming, they must meet the 
standards set for psychological" and educational tests (American Psycho- 
logical Association, 1974; Salvia & Ysseldyke, 19?8) . Most important 
among these^star u^ ir d^ die Lhc technical c h a r acteristics of a measure's 
validity and reliability (Thorndike & Hageni, 1977). 

In the formative evaluation of written expression, several behavior- 
al measures of writing performance are suggested as valid (Deno, Marston, 
& Mirkin, 1980); however, the reliability of these measures has yet to 
be determined. 

Nunnally (1978) states that "measurements are reliable to the extent 
that they are repeatable and that any random influence which tends to 
make measurements different from occasion to occasion. . .is a source of 
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measurement error 11 (p. 225). >Thus, reliability is an index of the 
accuracy and stability of a ipeasure. In the American Psychological 
Association's Standards for Educational and Psychological Tests (1974), 
four types of reliability are outlined: comparisons over time; com- 
parability of forms; internal consistency; and administration and 
scoring. 

Comparisons Over Time 

Comparisons over time usually take the torm of test-retest relia^ 
bility. The emphasis in this type of reliability is on the stability 
of the scores derived from a particular measure. Hopefully, the score 
obtained today will be quite similar to the score attained a week 
later. This approach, however, presents a. significant problem in that 
some students may improve during the test-retest interval, thus sunpress 
ing the reliability coefficient. 
C omparability of Forms 

Often referred' to as parallel-form reliability (Nunnally, 1978), 
this type of reliability analysis avoids the problems of learning 
and memory effects. Thorndike and Hagen (1977) suggest that varia- 

\ 

tion of student scores on a measure also may be due to a biaseu sam-* 
pling of tasks or items chosen for that measure. If a measure is 
inadequately constructed or if the sampling is indeed in error, then 
performance on the measure may not truly reflect the student's actual 

skills. As a result, we may not make generalizations about student 

f 

performance. For example, if the number of words written in a five- 
minute composition is not reliable, it quite likely does not truly 
index the student's skills in written expression. Thorndike and Hagen 



(1977) suggest that equivalent (parallel) forms of the test be 
produced and correlated. , This suggestion fits well into the forma- 
tive evaluation scheme^ecause such a system requires several par- 
allel measures of written expression (see Deno et al. f 1980). Paral- ■ 
lei test reliability is then quite important tq, substantiating the 
* reliability of all the behavioral measures of written expression. 
Internal Consistency 

Internal consistency measures the average correlation among all 
of the items included in a test (Nunnally,~L978) . The more reliable 
a measure is, the higher the pattern <?f interco'rrelations . 

One approach to indexing internal consistency is the split-half 
reliability estimate (Salvia & .Ysseldyke, 1978) . In tliiS approach 

i 

the items of a test usually are randomly assigned to two equal length 
tests. The correlation between these two tests is an estimate of the 
measure's reliability. In the procedures for measuring written expres- 
sion, a set of items does not exist. However, by dividing the five- 
minute written compositions into one-minute units, one may determine 
the internal consistency reliability of the formative measures by* 
treating each one-minute writing sample as an item. Assuming the 
student needs the first minute to warm up, a split-h4lf analysis would 
focus on minutes 2, 3, 4, and 5. 
Administration and Scoring 

v A fourth area of concern is the reliability of administration and 

i 

scoring procedures. Because the reliability of scoring is crucial to 
the formative evaluation of written expression, interscorer agreement 
is analyzed. According to the Standards (American Psychological 



Association, 1974), this type of reliability must be quite high. 

This paper focuses' on those types of reliability that make a ** 
significant contribution to the technical adequacy of the simple, 

r 

direct measures of written expression. Specifically, data will be ; 
presented with respect to comparisons over time, comparability of 
forms, internal consistency, and interscorer reliability for four 
direct measures of written expression (Total Words Written, Mature 
Words, Words Spelled Correctly, and Letters in Correct Sequence; 
cf . Deno et al. , 1980) . 

Method 

Comparisons Over Time 

Subjects , Twenty-eight learning disabled students attending a 
summer school progtam in a Minneapolis elementary school served as 
subjects for this study. Their ages ranged from 6 years, 5 months to 
12 years, 2 months, with an average age of 10 years. 

Procedure . Each student was administered two identical story 
starters, three weeks apart. The length of time for eabh administration 
was five minutes. On each occasion, the student's scores for Total' 
,Words Written, Mature Words, Words Spelled Correctly, and Letters in 
Correct Sequence were tabulated. Test-retest reliability for the three- 
week period was then determined by computing a Pearson Product-Moment 

c 

Correlation for each measure between each administration. 
CoTt^parability of Forms 

Subjects . Subjects were 161 elementary students selected randomly 
from two urban midwestern cities. 
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Procedure . To determine the parallel- form reliability of 
Total Words Written*, Words Spelled Correctly, and Letters in Correct 
Sequenpe, the written compositions of the subjects were scored on 
these measures. Each child completed two comparable Story Starters. 
For the first Story Starter, the child was asked to "Write a story 
about the night you were camping in che woods and^you heard strange 
noises outside your tenu." In the second story starter condition, the 
student was asked to "Pretend you are stranded on a tropical island, 
nrite a story about wl it happens to you. 11 In each situation, the child 
was given five minutes to write a story. 

Pearson Product-Moment Correlations were computed between the two 
compositions to determine the parallel-form reliability. 
Internal Consistency 

•% ' ■ 

Subjects . Subjects were 105 elementary students in grades 1 
through 6. They were selected randomly from six schools in a large 
midwestern city. «l : , 

Procedure . To determine the split-half reliability of the direct 
measures of written expression, the written compositions of subjects 
were examined to determine how far each student had written at the end 
of minutes 1, 2, 3, 4, and 5. Total Words Written, the number of 
Mature Words Written, total Words Spelled' Correctly and the total 
number of Letters in Correct Sequence were then computed for each one- 
minute unit. The number of words written for minutes 2 and 5 were then 
totaled, as were the results for minutes 3 and 4. These two sumr were 
then correlated. Performance during minutes 2 and 4 was totaled and 
similarly, minutes 3 and 5. Again, these two sums were correlated. 
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A third approach to demonstrating the internal consistency of the 
five-minute writing sample for the first measures employed Cronbach's ■ 
Alpha (Cronbach,- 1951). With this method the students 1 oytput for 
each minute was compared for consistency. 
Administration and Scoring 

Subjects . Subjects were 20 students from an elementary school 
in a large city in central Pennsylvania. Students were enrolled in 
grades 1 through 6. 

Procedure . Interscorer reliability for Total Words Written, Mature 
Words, Words Spelled Co'rrectly, and Letters in Correct Sequence was 
determined by correlating the scored results of four judges trained at 
the Institute for Research on Learning Disabilities. Twenty GpmpoSiKisns 
written by elementary students were scored by each' judge. Each judge 
was "blind" to the scores of the other judges. Interscorer agreement 

i * 

was then calculated in pair-wise fashion, producing a range of correla- 
te 

tions that included six coefficients for each measure of written ex- 
pression. 

Results and Discussion 
Comparison Over Time v 

The test-ret est coefficients for a oue— day interval and a three- 
week interval are presented in Table 1. For the one day test-retest 
period the correlations ranged from .57 to .92. The range of test-retest 
correlations for the three-week interval was .50 to .70. 



Insert Table 1 about here 
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It is likely that the reliabilities for the three-week interval 
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were attenuated because of the intervening- learning experiences of 
the 28 students, all of whom had practiced writing daily. Another 
fact&r suppressing the correlations jnight be attributed to the restricted 
,*ange of the sample (Nunnally, 1978); all students were learning dis- 
abled. 

The test-retest coefficients suggested that the^best estimates 
of reliability are found in the Total Words Written, Words Spelled 
Correctly, and the Letters in Correct Sequence measures. 
tomparabili ty> of Forms 

Parallel-test correlations were high. The parallel test correla- 
tion coefficient- for Total Words Written was .95, for Words Spelled 
" (Correctly was .95, and for Letters in Correct Sequence was .96. These 

ly reliable coefficients suggest that teachers may confidently use 
t 

a* series of story starters in the frequent measurement of written ex- 
pression. L_r 

In addition co' using different Story Starters in a formative evalua- 
tion system, Deno et al. (1980) ^a^Lso suggest employing other procedures 
to help students^write compositions. One of these methods is the Topic 
Sentence, usually a brief sentence asking the child to write a topical 
composition. An example is, "Write about what you will do during sum- 
mer vacation." Another alternative >is the use of picture stimuli. In 
these situations, the student is asked to write a story about a piccure 
that is presented. 

In a sense, ''correlations among a student's written performance on 
these stimulus approaches are a form of parallel-test reliability, and* 
the Pearson Product-Moment Correlations should be high* The correlations 
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between these various approaches are presented in Table 2. As can be 
seen, the reliability coefficients for Total Words Written and Words 
Spelled Correctly were high, ranging from .79 to .87. The reliability 
of Mature Words .was lower, ranging from .74 to .79. 

Insert Table 2 about here 

Internal Consistency 

Table 3 presents the split-half reliabilities for Total Words 
Written, Mature Words, Words Spelled Correctly, and Letters in Correct 
Sequence where the scores for minutes 2 and 4 were combined and cor- 
related with the combined total of minutes 3 and 5. Also included 
are the split-half reliabilities for minutes 2 and 5 combined compared 
to minutes 3 and 4 totaled. As may be seen, the correlations ranged 
from .96 to .99 for the split-half reliabilities. 

# 

Insert Table 3 about here 
a _ 

Cronbach (1951) created a more generalizable method for determining 
internal consistency. His approach, called Coefficient Alpha, is the 
average split-half correlation based on all possible divisions of a 
test into two parts. Using each one-minute unit in a five-minute 
written sample as an item, Coefficient Alpha was calculated for Total 
Words Written, Mature Words, Words Spelled Correctly, and Letters in 
Correct Sequence. These reliability estimate^ also are presented in 
Table 3, and ranged from .70 to ,87. 

In all, the internal consistency reliability of the direct measures 
of written expression was satisfactory. 

13 



Administration and Scoring 

The range of inter-judge reliability coefficients for Total 
Words Written, Mature Words, Words Spelled Correctly, and Letters in 
Correct Sequence is presented in Table 4. Also presented is the mean 
reliability coefficient. All reliabilities were extremely high, with 
coefficients of ,98 and better for Total Words Written, Words Spelled 
Correctly, and Letters in Correct Sequence. Inter-judge correlations 
for Mature Words ranged from .90 to .94. 

Insert Table 4 abo&t here > 

Again, these reliabilities are sufficiently high to assure con- 
fidence in the use of the various measures. 

Conclusions 

The reliability of four formative measures of written expression 
was investigated in this paper. With respect to comparisons over time, 
comparability cf forms, internal consistency, and inter scorer relia- 
bility, all measures appeared to meet the professional standards set 
for reliability. Further, the reliability coefficients for Total Words 
Words Spelled Correctly, and Letters in Sequence were consistently supe 
ior, demonstrating significant precision in measurement. 

Two implications may be drawn from the research. First, the high 
reliability estimates of the measures of written expression provide 
a necessary basis for the determination of their validity. Thorndike 
and Hagen (1977) noted that "the ceiling for the possible validity of 
a test is set by its reliability" (p. 87). The high coefficients 
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reported hfere document the opportunity for establishing substantial 
validity for th>". formative measures of written expression. 

Second, the research assures teachers and other educational pro- 
fessionals using formative evaluation measures that these procedures 
are accurate and stable. Indeed, the educational professional may 
feel quite confident about the precision of formative evaluation measure- 
ment of written expression in the classroom. 
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Table 1 

Test-Retest Reliability Coefficients for Four Simple, 
Direct Measures of Written Expression 



Interval 


Mature 
Words 


To tal 
Words 


Words 
Spelled 
Correctly 


Letters in 
Correct 
Sequence 


1 day 


.57 


.91 


.81 


.92 


3 weeks 


.50 


.64 


.62 


.70 
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Table 2 

Correlations Between Writing Stimulus Formats on 
Direct Measures of Written Expression 



Dependent 
Measure 


Story Starter 
and 

Topic Sentence 


Story Starter 
and 

Picture Stimulus 


Topic Sentence 
and 

. Picture Stimulus 


Mature Words 


.75 


.79 


.74 


Words Spelled 
Correctly 


.81 


.87 


.86 


Total Words 
Written 


.79 


.86 


.85 
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Table 3 

Internal^Consistency Reliabilities for Four Direct 
Measures of Written' Expression 



Total Mature Words Letters in 

Words Words Spelled Correct 

Written Written Correctly Sequence 



Minutes 2 & 5 ,99 ,98 ,96 ,98 

vs. 3 & 4 

Minutes 2 & 4 ,99 ,98 .97 ,99 

vs. 3 & 5 

Cronbach's .87 .74 .70 .87 

Alpha 



All correlations significant at the .001 le\ *1. 



18 



15 



Table 4 

Interscorer Reliability Coefficients for Four Trained Judges 
Scoring Direct Measures of Written Expression 

\ 



Range of Inter-Judge 




Reliability 


Mean Inter-Judge 


Coefficients 


Reliability 



Total Words Written 


.98 - 


.99 


.98 


Mature Words 


.90 - 


.94 


.92 


Words Spelled Correctly 


.98 - 


.99 


.98 


Letters in Correct Sequence 


.98 - 


.99 


.99 



All correlations significant at .001 
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